Skip to content

Sync NCAR/main branch of CCPP physics with ufs/dev branch#3190

Merged
gspetro-NOAA merged 21 commits into
ufs-community:developfrom
grantfirl:NCAR-main-sync-20260401
May 11, 2026
Merged

Sync NCAR/main branch of CCPP physics with ufs/dev branch#3190
gspetro-NOAA merged 21 commits into
ufs-community:developfrom
grantfirl:NCAR-main-sync-20260401

Conversation

@grantfirl
Copy link
Copy Markdown
Collaborator

@grantfirl grantfirl commented Apr 10, 2026

Commit Queue Requirements:

  • This PR addresses a relevant WM issue (if not, create an issue).
  • All subcomponent pull requests (if any) have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines), preferably on Ursa (Derecho or Hercules are acceptable alternatives). Exceptions: documentation-only PRs, CI-only PRs, etc.
    • Commit log file w/full results from RT suite run (if applicable).
    • Verify that test_changes.list indicates which tests, if any, are changed by this PR. Commit test_changes.list, even if it is empty.
  • Fill out all sections of this template.

Description:

This is primarily just a CCPP physics update, although it contains some host-side metadata changes to work with the ccpp-physics changes. See ufs-community/ccpp-physics#369 for a description of the CCPP-physics changes.

Commit Message:

* UFSWM - Sync NCAR/main branch of CCPP physics with ufs/dev branch 
  * UFSATM - Sync NCAR/main branch of CCPP physics with ufs/dev branch
    * ccpp-physics - Sync NCAR/main branch of CCPP physics with ufs/dev branch

Priority:

  • Critical Bugfix: Reason
  • High: Reason
  • Normal

Git Tracking

UFSWM:

  • N/A

Sub component Pull Requests:

UFSWM Blocking Dependencies:

  • Blocked by #
  • None

Documentation:

  • Documentation update required.
    • Relevant updates are included with this PR.
    • A WM issue has been opened to track the need for a documentation update; a person responsible for submitting the update has been assigned to the issue (link issue).
  • Documentation update NOT required.
    • Explanation: This just updates CCPP physics and doesn't change any results. It's possible that documentation changes should be done for the CCPP technical documentation related to the new ccpp_bcast capability.

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Adds New Tests/Baselines.
  • PR Updates/Changes Baselines.
  • No Baseline Changes.

Input data Changes:

  • None.
  • PR adds input data.
  • PR changes existing input data.

Library Changes/Upgrades:

  • Required
    • Library names w/versions:
    • Git Stack Issue (JCSDA/spack-stack#)
  • No Updates
    This does allow CCPP to use the IP library rather than the SP library, which is a prerequisite for using spack-stack 2.0+

Testing Log:

  • RDHPCS
    • Orion
    • Hercules
    • GaeaC6
    • Derecho
    • Ursa
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@grantfirl
Copy link
Copy Markdown
Collaborator Author

There are several time-outs in my Ursa RT run. I think that this may be related to our resources being overused and fairshare reflecting that, throttling our runs.

@dpsarmie
Copy link
Copy Markdown
Collaborator

I'll rerun the tests on Ursa as a secondary check.

Also, a new GNU warning was introduced in this PR:

/scratch4/NCEPDEV/stmp/Daniel.Sarmiento/grantfirl/UFSATM/ccpp/physics/physics/tools/mpiutil.F90:239:0:

  239 | #else if defined(__GFORTRAN__)
      |
Warning: extra tokens at end of #else directive

Can we get that fixed as part of this PR? Should just be an else if --> elif

@grantfirl
Copy link
Copy Markdown
Collaborator Author

I'll rerun the tests on Ursa as a secondary check.

Also, a new GNU warning was introduced in this PR:

/scratch4/NCEPDEV/stmp/Daniel.Sarmiento/grantfirl/UFSATM/ccpp/physics/physics/tools/mpiutil.F90:239:0:

  239 | #else if defined(__GFORTRAN__)
      |
Warning: extra tokens at end of #else directive

Can we get that fixed as part of this PR? Should just be an else if --> elif

Sure, I'll make that change and update the PR.

@dpsarmie
Copy link
Copy Markdown
Collaborator

It looks like the timeouts on the HAFS tests are real, unfortunately. I repeated one of the HAFS tests to make sure and it does seem to hang early on in the run. Every other test passed.

My first guess is that something in the CCPP updates is conflicting with the nests. This will need to be resolved before bringing in these changes.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

It looks like the timeouts on the HAFS tests are real, unfortunately. I repeated one of the HAFS tests to make sure and it does seem to hang early on in the run. Every other test passed.

My first guess is that something in the CCPP updates is conflicting with the nests. This will need to be resolved before bringing in these changes.

@dpsarmie Thanks very much for rerunning those. I'll see if I can debug the HAFS tests that are timing out. I'm guessing that it's another issue with the new ccpp_bcast functionality, perhaps triggered by the nests.

@gspetro-NOAA gspetro-NOAA added No Baseline Change No Baseline Change UFSATM There are changes to the UFSATM repository. CCPP There are changes to a CCPP repository. labels Apr 14, 2026
@gspetro-NOAA gspetro-NOAA moved this from Evaluating to Waiting for Reviews (subcomponent) in PRs to Process Apr 14, 2026
@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@grantfirl Just wanted to confirm that this PR is ready for review? If so, I'll request UFSATM reviews.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@grantfirl Just wanted to confirm that this PR is ready for review? If so, I'll request UFSATM reviews.

@grantfirl Just wanted to confirm that this PR is ready for review? If so, I'll request UFSATM reviews.

No, there are still problems with the nested tests. I'll probably need help debugging these. I'm going to ask Dom, who made most of the changes related to mpi_bcast, to help.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@jkbk2004 @gspetro-NOAA Do you know of anyone who could help debug this nesting problem? I described what I'm seeing for @climbfuji here: ufs-community/ccpp-physics#369 (comment)

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@jkbk2004 @gspetro-NOAA Do you know of anyone who could help debug this nesting problem? I described what I'm seeing for @climbfuji here: ufs-community/ccpp-physics#369 (comment)

Perhaps @DusanJovic-NOAA would have some idea since it's happening in UFSATM? That said, @NickSzapiro-NOAA suggested that you could potentially revert some of the CCPP changes in your PR (like NCAR/ccpp-physics#1187) to isolate where the failures might be coming from.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA Do you know/remember what happens with the MPI root PE used in a nested situation? It seems as though there is a conflict with the MPI root PE when nesting is turned on. The MPI_bcast calls as part of this PR are trying to use the mpi_root value given to physics (which is set from FV3's mpp_root_pe() function) rather than a hard-coded 0 value, which is used in various places throughout the UFSATM code. I'm wondering if this is causing the conflict/hanging behavior noticed in this PR?

Any insights are appreciated!

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA Do you know/remember what happens with the MPI root PE used in a nested situation? It seems as though there is a conflict with the MPI root PE when nesting is turned on. The MPI_bcast calls as part of this PR are trying to use the mpi_root value given to physics (which is set from FV3's mpp_root_pe() function) rather than a hard-coded 0 value, which is used in various places throughout the UFSATM code. I'm wondering if this is causing the conflict/hanging behavior noticed in this PR?

Any insights are appreciated!

Root PE on the nested domain is the rank of the first mpi task that runs the nest. For example, if a parent domain runs on 80 tasks, and nest runs on 60 tasks:

first 80 tasks, ranks 0,..,79 will have root PE = 0, and
next 60 tasks, ranks 80,...,139 will have root PE = 80.

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator

I think I see where the problem is. In atmos_model.F90 we currently have:

!--- setup Init_parm
   Init_parm%me              =  mpp_pe()
   Init_parm%master          =  mpp_root_pe()
   Init_parm%fcst_mpi_comm   =  fcst_mpi_comm
   Init_parm%fcst_ntasks     =  fcst_ntasks
   ....

fcst_mpi_comm is the communicator which includes all tasks for all domains, the entire 'fcst grid component'. It is initialized from fcst's ESMF VM. If this communicator is actually used in ccpp-physics for collective calls, then all tasks need the same value for the 'root' task. But in this case, mpp_root_pe() is different for each domain. I think this is because it's an FMS routine used by fv3, and in fv3 they have different ways of identifying which domain (grid) a given task belongs to.

I think solution is eiteher:

  1. Set Init_parm%master to 0 so that all tasks in fcst_mpi_comm have the same 'root'
    or:
  2. Split the communicator into two (or more) communicators, one for each domain (grid), and adjust me and master to be relative to those new communicators.

I'm not sure about the rules for how ccpp treats multiple domains, or if it lacks a concept of multiple domains and treats all point columns uniformly, just as a list/array of columns.

This hasn't been an issue so far, probably because ccpp-physics hardcodes '0' for the root of MPI collective calls. Basically solution 1) reverts to that behavior.

@climbfuji
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA I didn't follow this closely, but regarding the CCPP question: This depends on how the multiple domains are implemented. Is there a separate "copy" (instance) of CCPP for the global domain (the six tiles) and the nested domain (one tile)? Or are the grid points of the nested domain simply appended to those of the global domain?

@DusanJovic-NOAA
Copy link
Copy Markdown
Collaborator

@climbfuji I'm not sure I fully understand what you mean by 'a separate "copy" (instance) of CCPP for the global domain (the six tiles) and the nested domain (one tile)'. If I understand correctly how this works, CCPP runs over a list of columns on each MPI task. Each task runs it's own copy/instance of CCPP. Does CCPP 'know' which (geographical) domain that list of points belongs to? I do not think it does. If it does 'know,' how does it know? The notion of domains, and the potential need for ccpp to 'know' this, is relevant if, for example, collective calls should only communicate with other (MPI) tasks belonging to the same domain, for example finding 'per domain' average, minimum, or maximum. Or something like that. I don't know if this is or ever will be needed in ccpp. In that case, domain communicators will be required, I think. 

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@DusanJovic-NOAA @climbfuji I also had the idea of setting Init_parm%master to 0 as a test, and I can confirm that it does indeed work (model no longer hangs). I'm just not sure that it is the preferred solution. Dom, if a different host model handles MPI communicators and roots differently, as long as all ccpp_bcast calls (and any remaining straight mpi_bcast calls) are set up to use whatever communicator/root that they're given, instead of hard-coded, is that OK from your perspective?

@climbfuji
Copy link
Copy Markdown
Collaborator

@DusanJovic-NOAA @climbfuji I also had the idea of setting Init_parm%master to 0 as a test, and I can confirm that it does indeed work (model no longer hangs). I'm just not sure that it is the preferred solution. Dom, if a different host model handles MPI communicators and roots differently, as long as all ccpp_bcast calls (and any remaining straight mpi_bcast calls) are set up to use whatever communicator/root that they're given, instead of hard-coded, is that OK from your perspective?

Yes, absolutely. I am pretty sure that's how it works in NEPTUNE. We have a specific MPI communicator with a specific MPI root and CCPP should (and hopefully does) use those exclusively. Never MPI_COMM_WORLD or 0 by default.

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@grantfirl It looks like the CCPP and UFSATM hashes in your WM branch do not include the changes in your sub-PRs. It looks like something might have gone wrong with the hash update/sync. Thanks for addressing the UFSATM requested change!

@gspetro-NOAA Oops, so sorry. Should be fixed now.

@FernandoAndrade-NOAA FernandoAndrade-NOAA added In Testing The PR that is currently in testing stages ursa-ORT labels May 7, 2026
@gspetro-NOAA gspetro-NOAA moved this from Review to Schedule in PRs to Process May 7, 2026
@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@grantfirl On Hercules, I have 5 tests failing for the same reason with a TEST DID NOT COMPLETE status:
Failed Tests:

  • TEST rap_restart_gnu: FAILED: RUN DID NOT COMPLETE
    -- LOG: /work2/noaa/epic/gpetro/hercules/RTs/ufs-wm/3190/tests/logs/log_hercules/run_rap_restart_gnu.log
  • TEST conus13km_decomp_gnu: FAILED: RUN DID NOT COMPLETE
    -- LOG: /work2/noaa/epic/gpetro/hercules/RTs/ufs-wm/3190/tests/logs/log_hercules/run_conus13km_decomp_gnu.log
  • TEST rap_control_dyn64_phy32_gnu: FAILED: RUN DID NOT COMPLETE
    -- LOG: /work2/noaa/epic/gpetro/hercules/RTs/ufs-wm/3190/tests/logs/log_hercules/run_rap_control_dyn64_phy32_gnu.log
  • TEST conus13km_debug_decomp_gnu: FAILED: RUN DID NOT COMPLETE
    -- LOG: /work2/noaa/epic/gpetro/hercules/RTs/ufs-wm/3190/tests/logs/log_hercules/run_conus13km_debug_decomp_gnu.log
  • TEST conus13km_radar_tten_debug_gnu: FAILED: RUN DID NOT COMPLETE
    -- LOG: /work2/noaa/epic/gpetro/hercules/RTs/ufs-wm/3190/tests/logs/log_hercules/run_conus13km_radar_tten_debug_gnu.log

In the logs, the first error is:

  0: [2026-05-07T15:47:34.611] error: Unable to create TMPDIR [/local/scratch/gpetro/8865457]: Permission denied
  0: [2026-05-07T15:47:34.611] error: Setting TMPDIR to /tmp
 50: [hercules-06-25:1238572] mca_base_component_repository_open: unable to open mca_pmix_s1: libslurm_pmi.so: cannot open shared object file: No such file or directory (ignored)

Follow-on errors look like:

 34: A call to mkdir was unable to create the desired directory:
 34:
 34:   Directory: /local/scratch/gpetro/8865457
 34:   Error:     Permission denied
 34:
 34: Please check to ensure you have adequate permissions to perform
 34: the desired operation.
 34: --------------------------------------------------------------------------
 34: [hercules-06-25:1238556] [[18097,0],34] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 107
 34: [hercules-06-25:1238556] [[18097,0],34] ORTE_ERROR_LOG: Error in file ../../orte/util/session_dir.c at line 346
 34: [hercules-06-25:1238556] [[18097,0],34] ORTE_ERROR_LOG: Error in file ../../../../../orte/mca/ess/pmi/ess_pmi_module.c at line 487
 31: --------------------------------------------------------------------------
 31: It looks like orte_init failed for some reason; your parallel process is
 31: likely to abort.  There are many reasons that a parallel process can
 31: fail during orte_init; some of which are due to configuration or
 31: environment problems.  This failure appears to be an internal failure;
 31: here's some additional information (which may only be relevant to an
 31: Open MPI developer):
 34:
 34:   orte_session_dir failed
 34:   --> Returned value Error (-1) instead of ORTE_SUCCESS
 34: --------------------------------------------------------------------------
 31: --------------------------------------------------------------------------

I am going to reclone and run just those 5 tests to see if something was wonky with the clone because I've never seen an initial error like that before. Thought I'd post tho in case it triggers your memory about anything.
Orion passed because it doesn't fun GNU.
Derecho has some failing tests, but they are all intel and probably a different issue. Might pass on rerun. The Hercules stuff looked more concerning.
Hercules run_dir: /work2/noaa/epic/gpetro/hercules/RTs/ufs-wm/stmp/gpetro/FV3_RT/rt_292671

@BrianCurtis-NOAA
Copy link
Copy Markdown
Collaborator

@gspetro-NOAA could it be a space issue?

@grantfirl
Copy link
Copy Markdown
Collaborator Author

@gspetro-NOAA I have never come across those kinds of errors in this PR or elsewhere. It sure seems machine-related and not code-related to me.

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

@gspetro-NOAA I have never come across those kinds of errors in this PR or elsewhere. It sure seems machine-related and not code-related to me.

Happily, it looks like it was a random hiccup! All the failing Hercules tests passed on rerun. 🤷‍♀️

@gspetro-NOAA
Copy link
Copy Markdown
Collaborator

Testing has completed successfully on all platforms. Note that the failing CI is expected. Repo_check is failing due to the merge of an LM4 PR outside of the PR process, and there is an increase of 1 remark on several tests due to a new "ifort: remark #10448" which will go away w/the switch to LLVM.
Leaving a note to merge in sub-PRs.

@gspetro-NOAA gspetro-NOAA merged commit 5eaf814 into ufs-community:develop May 11, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CCPP There are changes to a CCPP repository. In Testing The PR that is currently in testing stages No Baseline Change No Baseline Change UFSATM There are changes to the UFSATM repository.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants